manhattan_rides_df <- read_csv("manhattan_rides.csv")

manhattan_rides_df <-
  manhattan_rides_df %>% 
  mutate(
    day_of_week = factor(day_of_week, ordered = T, 
                         levels = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")), 
    year = factor(year), 
    age_group = factor(age_group, ordered = T,
                       levels = c("18-25","26-35", "36-45", "46-55", "56-65", "66-85"))
  )

manhattan_rides_df %>% 
  group_by(age_group) %>% 
  summarize(min = min(age), max = max(age), obs = n())
## # A tibble: 6 x 4
##   age_group   min   max    obs
##   <ord>     <dbl> <dbl>  <int>
## 1 18-25        18    25  35068
## 2 26-35        26    35 102948
## 3 36-45        36    45  56694
## 4 46-55        46    55  43430
## 5 56-65        56    65  26232
## 6 66-85        66    85   6734

Exporatory Data Analysis

manhattan_rides_df %>% 
  group_by(day_of_week, year) %>% 
  summarize(obs = n()) %>% 
  ggplot(aes(x = day_of_week, y = obs, group = year, color = year)) +
  geom_point() + 
  geom_line()

Fewer rides during the week in 2020 (presumably because of WFH), but more rides on the weekends (presumably because people avoid subway/ ubers)

manhattan_rides_df %>% 
  group_by(start_date, year) %>% 
  summarize(obs = n()) %>% 
  ggplot(aes(x = start_date, y = obs, group = year, color = year)) +
  geom_line() + 
  geom_smooth(se = FALSE)

Not that helpful, but not a meaningful difference in numbers of rides between 2019 and 2020 except maybe March/ April where there appears to be a slight dip

manhattan_rides_df %>% 
  group_by(year) %>% 
  mutate(
    month = month(starttime, label = T)
  ) %>% 
  filter(tripduration < 1500) %>% 
  ggplot(aes(x = month, y = tripduration)) + 
  geom_boxplot() +
  facet_grid(~year)

#Plotly Version
manhattan_rides_df %>% 
  group_by(year) %>% 
  mutate(
    month = month(starttime, label = T)
  ) %>% 
  filter(tripduration < 2500) %>% 
  plot_ly(
    x = ~month, 
    y = ~trip_min,
    color = ~year,
    type = "box") %>% 
  layout(
    boxmode = "group",
    title = "Duration of Citibike Rides by Month",
    xaxis = list(title = "Month"),
    yaxis = list(title = "Trip Duration in Minutes")
    )

Looks like maybe the overall length of trips in 2019 was more consistent. 2020 had a bump in duration of rides, starting in April. Overall, trip length seems more variable in 2020.

manhattan_rides_df %>% 
  group_by(year) %>% 
  mutate(
    month = month(starttime, label = T)
  ) %>% 
  group_by(year, month) %>% 
  summarise(obs = n()) %>% 
  ggplot(aes(x = month, y = obs, group = year, color = year)) + 
  geom_line()

#Updated to plotly
manhattan_rides_df %>% 
  group_by(year) %>% 
  mutate(
    month = month(starttime, label = T)
  ) %>% 
  group_by(year, month) %>% 
  summarise(obs = n()) %>% 
  plot_ly(
    x = ~month, 
    y = ~obs, 
    color = ~year,
    type = "scatter",
    mode = "lines") %>%  
  layout(
    title = "Number of Citibike Rides per Month",
    xaxis = list(title = "Month"),
    yaxis = list(title = "Rides")
  )

Huge drop in monthly trips in April 2020. Lockdown started mid/late March so this coincides with people transitioning to WFH and largely staying inside to minimize contacts. The ride numbers bounce back quite a bit after this but not to 2019 levels.